Dataset Cleaning

Starting Anew

By default, a Dataset will attempt to reinitialise at launch. In short, this means that it looks for a file that looks like itself. If it finds such a file, it will recreate itself from it.

For the user, this brings the benefits that your workflow is robust against restarts and data loss. However this does mean that datasets act like “accumulators”, constantly gaining attributes and runners as you test.

There can come a point where you realise that something you set earlier could be causing problems (or simply isn’t needed), this tutorial will run through some methods of dealing with these situations.

Lets start with the most basic case, skip:

[1]:
from remotemanager import Dataset

def f(inp):
    return inp

ds = Dataset(f,
             skip=False  # new option!
            )

When set to False, the skip argument will force the Dataset to start anew, and thus any variables that were stored are lost. This can also be done by deleting the database file (and in fact, this is what is done here internally, though then a new one is created). The filename is a combination of name-dataset-uuid.yaml. However if no name is set, it is omitted.

Deleting this file has the same effect as skip, though only once. It will be created along with the dataset.

The database filename can be seen using Dataset.dbfile

[2]:
ds.dbfile
[2]:
'dataset-9ebf1589.yaml'

You can also force this value to be whatever you want, but only at the dataset initialisation:

[3]:
ds_unique = Dataset(f,
                    dbfile = 'set_a_name_here',  # new option!
                    )

print(f'the database file for this dataset is now {ds_unique.dbfile}')
the database file for this dataset is now set_a_name_here.yaml

Now whenever you initialise a dataset with this filename, it will attempt to connect with that file.

Finer options

This is all well and good, but what if you don’t want to blow up your dataset and start again? For example, you know that one of your runners is causing issues and needs to be removed. Well there are options for this too.

Lets append some runs, and experiment with removing them.

Lets also create a function to show us some information about our runners.

[4]:
for run in range(7):
    ds.append_run(args={'inp': run})

def print_runs():
    for r_id, runner in ds.runner_dict.items():
        print(f'{r_id}: {runner.short_uuid} | {runner.args}')
appended run runner-0
appended run runner-1
appended run runner-2
appended run runner-3
appended run runner-4
appended run runner-5
appended run runner-6
[5]:
print_runs()
runner-0: 9e62f0bc | {'inp': 0}
runner-1: d3c4ab40 | {'inp': 1}
runner-2: 1628650a | {'inp': 2}
runner-3: 1fc9add9 | {'inp': 3}
runner-4: 6d5e0646 | {'inp': 4}
runner-5: 552e16a0 | {'inp': 5}
runner-6: 47fc1095 | {'inp': 6}

Now, we can look at all the ways of removing a run. We do this with ds.remove_run(id). Here, id is a “smart” value, and can be int, str or dict, the function will perform slightly differently based on the input type:

  • An int will be treated like a list index, and the runner at that id will be removed.

  • A dict will be treated like arguments, and the runner with those args will be searched for.

  • str is first checked against the runner names, and and is then checked against the uuid of each runner.

    • short and long uuids can be used (8 or 64 chars)

Runner Removal

Firstly, if you know the index of the runner within ds.runners, you can pass that id:

[6]:
ds.remove_run(0)
print_runs()
removed runner dataset-9ebf1589-runner-0
runner-1: d3c4ab40 | {'inp': 1}
runner-2: 1628650a | {'inp': 2}
runner-3: 1fc9add9 | {'inp': 3}
runner-4: 6d5e0646 | {'inp': 4}
runner-5: 552e16a0 | {'inp': 5}
runner-6: 47fc1095 | {'inp': 6}

Runner 0 has dissappeared!

Note

This function always removes the runner at that index. So in this case if we call again with index 0, runner-1 would be removed, as it is the first.

Next, is the uuid. This can be found by printing the uuid of a runner you have access to:

[7]:
r_uuid = ds.runners[2].uuid

print(r_uuid)

r_short_uuid = ds.runners[3].short_uuid

print(r_short_uuid)
1fc9add953e10b337317695f16173d23ce790b3d03d78192fafb55b9e6bc51e7
6d5e0646
[8]:
ds.remove_run(r_uuid)
ds.remove_run(r_short_uuid)
print_runs()
removed runner dataset-9ebf1589-runner-3
removed runner dataset-9ebf1589-runner-4
runner-1: d3c4ab40 | {'inp': 1}
runner-2: 1628650a | {'inp': 2}
runner-5: 552e16a0 | {'inp': 5}
runner-6: 47fc1095 | {'inp': 6}

We grabbed the uuids of runners at id 2 and 3, which in the runner list would be runner numbers 3 and 4 (as we removed 0). These two have also dissappeared.

If you don’t know the id of the runner and don’t have their uuids stored, you can remove by args. This attempts to match passed args with those that the runners have stored and will attempt to remove them. This is arguably the most flexible (and useful) method, though is less efficient than other approaches.

For example, if you append run with {'inp': 6}, you may remove that runner by calling:

[9]:
ds.remove_run({'inp': 6})
print_runs()
removed runner dataset-9ebf1589-runner-6
runner-1: d3c4ab40 | {'inp': 1}
runner-2: 1628650a | {'inp': 2}
runner-5: 552e16a0 | {'inp': 5}

Looks like runner 6 (who had inp: 6) has also gone.

Finally, removing via id may be confusing if runs have already been removed, (i.e. you don’t have a continuous, zero-indexed list). Thus, you can remove by the actual id by passing remove_run("runner-{n}")

[10]:
ds.remove_run('runner-5')
print_runs()
removed runner dataset-9ebf1589-runner-5
runner-1: d3c4ab40 | {'inp': 1}
runner-2: 1628650a | {'inp': 2}

Leaving us with just runners 1 and 2.

This function also returns True or False depending on if it removed a runner or not:

[11]:
print('removed runner-2?:', ds.remove_run(1))
print('removed runner-3?:', ds.remove_run(2))
print('\nfinal runner list:')
print_runs()
removed runner dataset-9ebf1589-runner-2
removed runner-2?: True
removed runner-3?: False

final runner list:
runner-1: d3c4ab40 | {'inp': 1}

Clearing Runners

There is one additional option for removing runners, and that’s wipe_runs. This removes all runs from the dataset:

Note

We’re using confirm=False here to allow the notebook to be tested, but be aware that this will skip the confirmation dialog

[12]:
ds.wipe_runs(confirm=False)
print(ds.runners)
[]

Persistence

All these changes are of course, saved to the database when performed, so be careful when using them. If we simulate restarting this notebook (or a different notebook that also uses this dataset), we will see no runners:

[13]:
ds = Dataset(f)
print(ds.runners)

ds.append_run({'inp': 2})
print(ds.runners)
[]
appended run runner-0
[dataset-9ebf1589-runner-0]

re-adding runners

Adding runners back to a dataset that has “holes” within its runner storage will cause no harm. Runners will be added to fill any missing spaces then continue as normal after that:

[14]:
for run in range(10):
    ds.append_run({'inp': run})
appended run runner-1
appended run runner-2
runner runner-0 already exists
appended run runner-3
appended run runner-4
appended run runner-5
appended run runner-6
appended run runner-7
appended run runner-8
appended run runner-9
[15]:
print_runs()
runner-0: 1628650a | {'inp': 2}
runner-1: 9e62f0bc | {'inp': 0}
runner-2: d3c4ab40 | {'inp': 1}
runner-3: 1fc9add9 | {'inp': 3}
runner-4: 6d5e0646 | {'inp': 4}
runner-5: 552e16a0 | {'inp': 5}
runner-6: 47fc1095 | {'inp': 6}
runner-7: df3419b5 | {'inp': 7}
runner-8: 8b0ff3e5 | {'inp': 8}
runner-9: ca823715 | {'inp': 9}

Note here how we now have 10 runners as expected. Runner with inp: 2 has been skipped, as it already exists.

Run Args

Now you know how to remove runners, what about run args? If you want to update a value, in most cases you can simply overwrite the value. Though if a run argument is causing issues, you can also delete it with the usual python syntax.

Note

Starting in version 0.10.0, run_args are no longer accessible at the dataset level (i.e. ds.mpi), and must be accessed via the run_args property.

[16]:
ds.set_run_arg("mpi", 16)

print(ds.run_args["mpi"])
16
[17]:
print(ds.global_run_args)

del ds.run_args["mpi"]

print(ds.global_run_args)
{'skip': True, 'force': False, 'asynchronous': True, 'local_dir': 'temp_runner_local', 'remote_dir': 'temp_runner_remote', 'mpi': 16}
{'skip': True, 'force': False, 'asynchronous': True, 'local_dir': 'temp_runner_local', 'remote_dir': 'temp_runner_remote'}

Cleaning Directories

Added in version 0.5.9.

Too much clutter from testing? Dataset has some functions which help with deleting unwanted data:

  • dataset.wipe_local() will attempt to delete any local directories

  • dataset.wipe_remote() will attempt to delete any remote and run directories

If you really want to reset, dataset also provdes a dataset.hard_reset() function, which will do all of the above, delete the database file and then clear any runners. This essentially gives you a like-new dataset.